QTM 447 Lecture 12: Image Classification

Kevin McAlister

February 20, 2025

Neural Networks

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \newcommand{\ostar}{\mathbin{\mathpalette\make@circled*}} \]

Neural networks are a powerful framework for machine learning!

  • Start with an initial set of features

  • Create a mapping of the initial feature space to a \(K_1\) dimensional space where each dimension of the hidden space corresponds to some subfunction of the features

  • Each subfunction adds together to create the original feature

  • Just organized in a meaningful way that allows a classifier to better separate between classes

Neural Networks

Fully connected NNs have the following form:

Each layer takes an input, \(\mathbf x_i\) and translates the input to a \(K\) dimensional space through a linear mapping. For dimension \(k \in K\):

\[\underset{1 \times 1}{\mathbf q_{ik}} = \underset{1 \times K_{-1}}{\mathbf W_k^T} \underset{K_{-1} \times 1}{\mathbf x_i} + \underset{1 \times 1}{b_{k}}\]

Each value is the passed through an activation function to induce nonlinearity in the dimension mapping:

\[\underset{1 \times 1}{\mathbf q_{ik}} = \underset{1 \times 1}{\max(0,q_{ik})}\]

Repeat this mapping for each hidden layer.

At the final step, create a linear classifier in the altered feature space!

Neural Networks

Neural Networks

Can’t fully display the final 128 dimenional space

  • But, the 2D PCA shows pretty good separation!

Flattening works okay for this data set.

  • Generically, this isn’t ideal though!

  • Images have spatial information; the values of pixels in a neighborhood of the image are informative for each other!

Neural Networks

Let the hidden scores for \(i\) at a layer be denoted by \(\mathbf h_i \in \mathbb R^K\). For element \(k \in K\) in the first hidden layer, we can compute \(h_{ik}\) as:

\[h_{ik} = \varphi(\mathbf w_k^T \mathbf x_i)\]

  • This is an inner product - \(\sum \limits_{j = 1}^P w_{j,k} x_{ij}\)

For zero-centered values, this is equivalent to the covariance between the two vectors!

And we’ll only see positive covariance because of the ReLU nonlinearity!

Neural Networks

Each \(\mathbf w_k\) denotes a template for a part of the image.

Then computing the inner product of \(\mathbf x_i\) and \(\mathbf w_k\) show the similarity between the \(i^{th}\) input and the template

  • If the similarity is big, large positive value

  • If the similarity is small, 0 or small positive value

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Neural Networks

Each of these images has the three prong feature!

The problem is that our weight matrix does not exhibit translational invariance

  • The location of the pattern matters!

Neural Networks

Neural Networks

How do we encode colors in an image?

  • Each pixel is a vector of 3 values between 0 and 255: Red (R), Blue (B), and Green (G)

  • Combined, we can create basically any color!

Demonstration

Neural Networks

RGB images present another problem with using FCNNs for analyzing images: number of parameters

Let’s say that we have 100k 28 x 28 images (relatively low resolution) with 3 color channels

  • Each image has 2352 features

Our entire feature set now has 235 million features!

  • The first layer with 128 hidden units would estimate 301 thousand coefficients!

  • That’s too many…

Neural Networks

To get around these problems, ML researchers introduced convolutional neural networks

  • The first few layers of a CNN are convolutional layers that divide the input into small overlapping image patches - collections of pixels in a small neighborhood.

  • Each image patch is then compared to \(K\) templates - small weight matrices (think \(3 \times 3\) or \(5 \times 5\)) that are learned from the data that represent different patterns within each image.

  • The single template with \(5 \times 5\) weights is convolved with the input image - slid over all possible locations in the original 2D or 3D image space - the produce a new feature mapping after passed through a ReLU transform.

Neural Networks

Key point 1:

Convolution operations will yield translational invariance

  • The hidden values will be the same regardless of where the pattern is matched!

This is key for image classification!

  • A cat is a cat regardless of where its ears are in the image!

Neural Networks

Key point 2:

A convolution operation has very few parameters compared to a FCNN!

For a single hidden unit with a \(28 \times 28 \times 3\) image, we need to compute 2352 weights!

For a \(5 \times 5 \times 3\) filter slid over the \(28 \times 28 \times 3\) image, we only have 75 parameters!

  • If convolution works well for pre-processing images for fully connected layers (e.g. splitting the images into the components that make a cat a cat), then we can save a lot of training time!

Convolutions

In the machine learning literature, the convolution operation is actually the cross-correlation operation.

  • We’re just going to call this convolution to stay in line with the literature

1D Convolution

\[[1,2,3,4,5,6] \circledast [1,2] = [5,8,11,14,17]\]

Convolutions

1D Convolution

\[[\boxed{\color{red}1,2},3,4,5,6] \circledast [\boxed{\color{red}1,2}] = [\boxed{\color{red}5}, 8, 11,14,17]\]

\[(1*1) + (2*2) = 5\]

Convolutions

1D Convolution

\[[1,\boxed{\color{red}2,3},4,5,6] \circledast [\boxed{\color{red}1,2}] = [5, \boxed{\color{red}8}, 11,14,17]\]

\[(2*1) + (3*2) = 8\]

Convolutions

1D Convolution

\[[1,2,\boxed{\color{red}3,4},5,6] \circledast [\boxed{\color{red}1,2}] = [5, 8, \boxed{\color{red}11},14,17]\]

\[(3*1) + (4*2) = 11\]

Convolutions

\[[1,2,3,4,5,6] \circledast [1,2] = [5,8,11,14,17]\]

  • The convolution of a 6-vector with a 2-vector yields a 5-vector

  • In general, a convolution of a larger vector with a smaller filter will result in a smaller output!

Convolutions

2D Convolution

Let’s consider a 3x3 matrix \(\mathbf A\) and a 2x2 weight matrix \(\mathbf W\).

\[\begin{bmatrix} \boxed{a_{11}} & \boxed{a_{12}} & a_{13} \\ \boxed{a_{21}} & \boxed{a_{22}} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ \end{bmatrix} = \mathbf C\]

The top-left element of \(C\) is computed as follows:

\[ c_{11} = a_{11}w_{11} + a_{12}w_{12} + a_{21}w_{21} + a_{22}w_{22} \]

Convolutions

2D Convolution

Let’s consider a 3x3 matrix \(\mathbf A\) and a 2x2 weight matrix \(\mathbf W\).

\[\begin{bmatrix} a_{11} & \boxed{a_{12}} & \boxed{a_{13}} \\ a_{21} & \boxed{a_{22}} & \boxed{a_{23}} \\ a_{31} & a_{32} & a_{33} \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ \end{bmatrix} = \mathbf C\]

The next element of \(C\) is computed as follows:

\[ c_{12} = a_{12}w_{11} + a_{13}w_{12} + a_{22}w_{21} + a_{23}w_{22} \]

Convolutions

2D Convolution

Let’s consider a 3x3 matrix \(\mathbf A\) and a 2x2 weight matrix \(\mathbf W\).

\[\begin{bmatrix} a_{11} & a_{12} & a_{13} \\ \boxed{a_{21}} & \boxed{a_{22}} & a_{23} \\ \boxed{a_{31}} & \boxed{a_{32}} & a_{33} \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ \end{bmatrix} = \mathbf C\]

The next element of \(C\) is computed as follows:

\[ c_{21} = a_{21}w_{11} + a_{22}w_{12} + a_{31}w_{21} + a_{32}w_{22} \]

Convolutions

2D Convolution

Let’s consider a 3x3 matrix \(\mathbf A\) and a 2x2 weight matrix \(\mathbf W\).

\[\begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & \boxed{a_{22}} & \boxed{a_{23}} \\ a_{31} & \boxed{a_{32}} & \boxed{a_{33}} \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} \\ w_{21} & w_{22} \\ \end{bmatrix} = \mathbf C\]

The next element of \(C\) is computed as follows:

\[ c_{22} = a_{22}w_{11} + a_{23}w_{12} + a_{32}w_{21} + a_{33}w_{22} \]

Convolutions

All we’re doing is sliding the kernel matrix over the image to get some summary of the patches!

As we showed above, this is computing the cross correlation of the filter with the segment of the image!

If the value is high, then the cross product of the kernel with the image segment is also high

  • High covariance between the filter and the patch

Tells us if the feature is present in the patch or not.

Convolutions

Before we show this off, I should mention one thing about convolutions and the resulting size

In the 2D case with a \(3 \times 3\) image and a \(2 \times 2\) filter, we ended up with a \(2 \times 2\) convolution.

Try to find the following sizes yourself using only valid configurations:

  • \(4 \times 4\) image and a \(2 \times 2\) filter

  • \(4 \times 4\) image and a \(3 \times 3\) filter

  • \(4 \times 3\) image and a \(2 \times 2\) filter

  • Can you come up with the general rule?

Convolutions

For a \(n_h \times n_w\) image and a \(f_h \times f_w\) filter, we get a:

\[(n_h - f_h + 1) \times (n_w - f_w + 1)\]

convolution.

  • This is called a valid convolution

Convolutions

To yield a same convolution (the output is the same size as the input), we can use zero-padding to make the edges of the image wider without injecting new pixel information.

  • 0 corresponds to white in the pencil/paper case, so we’re just saying make the image a little bigger!

Convolutions

2D Convolution

Let’s consider a 5x5 matrix \(\mathbf A\) and a 3x3 weight matrix \(\mathbf W\).

\[\begin{bmatrix} \boxed{0} & \boxed{0} & \boxed{0} & 0 & 0 \\ \boxed{0} & \boxed{a_{11}} & \boxed{a_{12}} & a_{13} & 0 \\ \boxed{0} & \boxed{a_{21}} & \boxed{a_{22}} & a_{23} & 0 \\ 0 & a_{31} & a_{32} & a_{33} & 0 \\ 0 & 0 & 0 & 0 & 0 \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ \end{bmatrix} = \mathbf C\]

Convolutions

If we zero pad \(p_h\) rows to the top and bottom of the image and \(p_w\) columns to the left and right of the image, then the resulting convolution will have size:

\[(n_h + 2p_h - f_h + 1) \times (n_w + 2p_w - f_w + 1)\]

Assuming a square filter and setting \(p_h = p_w\), what should we set the pad width to such that the convolution is of the same size as the input image?

  • This is why we typically use an odd number for the square filter size!

Convolutions

With same convolution, we can think of the resulting feature maps given an odd-sized kernel centering on each pixel of the image.

\[\begin{bmatrix} \boxed{0} & \boxed{0} & \boxed{0} & 0 & 0 \\ \boxed{0} & \boxed{a_{11}} & \boxed{a_{12}} & a_{13} & 0 \\ \boxed{0} & \boxed{a_{21}} & \boxed{a_{22}} & a_{23} & 0 \\ 0 & a_{31} & a_{32} & a_{33} & 0 \\ 0 & 0 & 0 & 0 & 0 \\ \end{bmatrix} \circledast \begin{bmatrix} w_{11} & w_{12} & w_{13} \\ w_{21} & w_{22} & w_{23} \\ w_{31} & w_{32} & w_{33} \\ \end{bmatrix} = \mathbf C\]
  • Looking for the pattern given in the filter in the receptive field around each pixel

Convolutions

Let’s look back at some threes

Convolutions

Let’s create a \(5 \times 5\) filter to try and find the left-hand prongs of the threes:

\[ \begin{bmatrix} 0 & 0 & -1 & 0 & 0 \\ 0 & -1 & -2 & -1 & 0 \\ -1 & -2 & 16 & -2 & -1 \\ 0 & -1 & -2 & -1 & 0 \\ 0 & 0 & -1 & 0 & 0 \end{bmatrix} \]

Convolutions

Convolutions

Convolutions

Convolutions

Using one filter produces one feature map

  • A representation of the new feature space after a convolution with a filter

This one doesn’t work too well

  • It finds the edges of a three quite well!

  • It’s probably better if we try to learn the weights of the filter from the data!

What if we want more than 1 feature map?

  • We add more filters!

Convolutions

A convolutional layer is a layer in a DNN where we perform \(K\) convolutions with \(K\) filters

  • Most commonly of the same size, but don’t have to be!

Convolutions

A common small version of a CNN is called LeNet5Small

  • Let’s look in the notebook

Convolutions

Given a set of weights for the kernel, \(\mathbf W_k\), and the corresponding bias term, \(b_k\), we can compute the feature map for a given image as:

\[\mathbf W_k \circledast \mathbf X_i + b_k\]

  • Each convolutional layer can have multiple filters, so we can look at each feature map separately

  • 6 here

Convolutions

For Grayscale images (1 color channel), the first layer of convolutional filters (after passed through a ReLU activation), tend to learn oriented edges and outlines!

  • This is way more interpretable than the feature maps learned from FCNNs

And way fewer parameters!

The total convolution weight matrix is a \(5 \times 5 \times 6\) matrix

  • 150 parameters

Compare this to a fully connected layer with 6 hidden units

  • 4704 parameters

Convolutions

For each filter (of size \(5 \times 5\) in this case), there are learnable parameters that dictate the convolution

  • Choose parameters that describe important features within the images

  • Convolutions that describe most of the variation important in dictating the class of the image!

Convolutions

These parameters (convolution weights) are learned via backpropgation in the same way that other weights are learned!

  • The backprop update steps are similar to those for FCNNs

  • We aren’t going to go over these as it is largely redundant with our previous lectures

  • See here for a pretty good summary of how backprop works for the convolutional layers

Convolutions

Let’s look at the values for each “pixel” in the feature map for an image

Convolutions

The neighboring values are largely redundant

  • The values don’t change very much from neighbor to neighbor

  • Small drops

The reason for this is that the receptive field for each pixel when computing the convolution is largely shared neighbor to neighbor

Convolutions

Draw a picture on the board

  • Easier to draw than generate via Python!

Convolutions

One method of reducing this redundancy is to perform a strided convolution

  • Instead of using every pixel as a center for the filter, skip every \(s^{th}\) input

  • We’re still getting most of the information, just not doing every possible filter location!

An added benefit of strided convolution is that we shrink the size of the feature map for each image.

  • With same convolution via zero padding, we get a \(n_h \times n_w\) feature map.

  • For a \(f \times f\) filter, even zero-padding of \(p\) on each side, and even striding on the rows and columns (every \(s\) elements), we get:

\[\frac{n_h + 2p - f + s}{s} \times \frac{n_w + 2p - f + s}{s}\]

Convolutions

Strided convolutions are faster than dense convolutions

  • Less evaluations to get the resulting feature map

Cost is loss of resolution in the feature map

  • Doesn’t matter for sufficiently high-res images!

Convolutions

With strided convolutions, we don’t really lose all that much!

  • The model computes quite a bit quicker since we’re computing less inner products

The cost is a little accuracy.

  • Less clarity in the feature map.

Striding can be really useful when the input images are big

  • When they’re not, striding doesn’t add that much.

Convolutions

The most popular way to deal with redundancies and reduce the dimensionality of the resulting feature map is to follow up each convolutional layer with a pooling layer

In each pooling layer, we look at each \(K \times K\) window of the feature map and return a single number summary within the filter.

Two common types:

  • Max pooling - return the maximum value within the window

  • Average Pooling - return the average value within the window

Convolutions

Max pooling:

\[\begin{bmatrix} \boxed{1} & \boxed{2} & 3 \\ \boxed{4} & \boxed{5} & 6 \\ 7 & 8 & 9\\ \end{bmatrix} = \begin{bmatrix} \boxed{5} & 6\\ 8 & 9\\ \end{bmatrix}\]

Convolutions

Max pooling takes the strongest signal within each window of the feature map

  • This introduces locational invariance to the process; it doesn’t matter where in the window we “hit”

Max pooling also reduces the dimensionality of the resulting feature map

  • Pooling is most commonly done with a window of size \(K \times K\) and a stride of \(s\)

  • When \(s = K\), we divide the image into non-overlapping partitions and take the pool

Note that there are no parameters here!

Convolutions

Suppose that our feature map is of size \(H \times W\)

  • We pool with a filter of size \(K \times K\) and stride \(s\)

The output is then a reduced feature map with size:

\[\frac{H - K}{s + 1} \times \frac{W - K}{s + 1}\]

Convolutions

Let’s look back at (small) LeNet5

Convolutions

We can see that max pooling:

  • Reduces the size and resolution of each feature map

  • Accentuates the activated pixels

Pooling layers are critical for scalable implementations of CNNs!

Convolutions

After an initial convolution + pooling layer, each image has \(K\) feature maps of size \(H \times W\)

Given \(C\) pooled feature maps of size \(H \times W\) for \(N\) images, we can express the new set of “images” as a new image with \(C\) channels

\[C \times H \times W\]

  • A tensor

Convolutions

Why stop at 1 convolutional layer?

  • Create a convolution over the new “image”

  • Define a filter that looks at a \(K \times K\) patch in each feature map and across feature maps

  • Next filter will be \(C \times K \times K\)

  • The depth of a filter should always be equal to the number of input channels!

Seems more complicated, but it isn’t really!

  • Each new element of the new feature map is a weighted combination of the “pixels” in the receptive field within each feature map and across the collection of feature maps!

Convolutions

Really cool visualization of this process:

By Adam Harley

Convolutions

Convolutions

Modern CNN architectures have the following form:

  • Input images (1 or 3 color channels)

  • Convolution + ReLU + Pool (reduce the feature map dimensionality) a few times

  • Flatten (take the 2d image and flatten it to a row like we’ve done previously)

  • FCNN (1 or 2 layers, typically)

  • Output layer

Convolutions

LeNet-5 is the seminal CNN architecture designed for the MNIST handwritten digits (1998!!!!)

  • Each input image is a \(1 \times 28 \times 28\) collection of grayscale pixel values

Convolutions

Convolutions

A quick note on computational complexity:

The most expensive part of a CNN is the convolution part!

Lets suppose that we have a single \(3 \times 32 \times 32\) input image and we want to convolve it with 10 \(5 \times 5\) filters - each color channel uses the same filter!

Convolutions

We pad the image the have a same convolution:

  • Output volume: \(10 \times 32 \times 32\)

  • Number of parameters: \(10 \times 3 \times 5 \times 5 = 750\) weights plus 10 bias params

  • Sliding over the entire image, we get 1024 multiply-add operations.

  • Each operation requires 75 multiply adds

  • \(75 \times 1024 \times 10 = 768,000\) ops per layer!

Convolutions

This adds up!

  • But, each operation is really simple

GPUs are key to making CNNs move at a respectable pace for modern sized image analysis problems!

Convolutions

Let’s look at the real LeNet5.

Convolutions

Subsample of 20,000 \(28 \times 28\) images in 10 classes

  • Takes around 20 seconds to train!

  • More operations, but less hard operations

  • Leads to faster compute time than the worse FCNN!

Faster training convergence

Better validation accuracy!

Convolutions

Next time:

  • Modern CNN architecture tips and tricks
  • Transfer Learning